Mining Constant Conditional Functional Dependencies for Improving Data Quality

نویسندگان

  • Devi Kalyani
  • W. Fan
  • F. Geerts
  • X. Jia
  • G. Cong
  • Wenfei Fan
  • Floris Geerts
  • Jianzhong Li
  • Philip Bohannon
  • Xibei Jia
  • Anastasios Kementsietsidis
  • Nicolas Pasquier
  • Yves Bastide
  • Rafik Taouil
  • Lotfi Lakhal
  • Paul De Bra
  • Rakesh Agrawal
  • RamaKrishnan Srikant
چکیده

This paper applies the data mining techniques in the area of data cleaning as effective in discovering Constant Conditional Functional Dependencies(CCFDs) from relational databases . These CCFDs are used as business rules for context dependent data validations. Conditional Functional Dependencies(CFDs) are an extension of Functional dependencies(FDs) which captures the consistency of data by supporting patterns of semantically related constants. Based on the hierarchy between FDs, CFDs and Association Rules :Union of Association Rules are CFDs, while union of CFDs are FDs. This paper proposes the algorithms used for Association Rule discovery to be reused for CCFD Mining i. e CFDs with constant patterns only . Three algorithms for CCFD mining namely CCFD-FPGrowth, CCFD-AprioriClose and CCFD-ZartMNR are provided in this paper. CCFD-FPGrowth uses FP-growth algorithm to find frequent itemsets and then generates rules as constant patterns from the set of frequent itemsets using modified Agrawal Association rule Generation algorithm. CCFD-AprioriClose uses Apriori algorithm to find frequent closed itemsets and then generates rules as constant patterns from the set of frequent closed itemsets using modified Agrawal Association rule Generation algorithm. CCFD-ZartMNR uses Zart algorithm to find closed itemsets and minimal generators and then generates minimal non-redundant rules from the set of closed itemsets. Experimental results on two real-world data sets show that this approach performs well across several dimensions such as recall, runtime and scalability.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Approximation Measures for Conditional Functional Dependencies Using Stripped Conditional Partitions

Received Apr 11, 2017 Revised May 5, 2017 Accepted May 24, 2017 Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for mor...

متن کامل

Discovering (frequent) constant conditional functional dependencies

Conditional functional dependencies (CFDs) have been recently introduced in the context of data cleaning. They can be seen as an unification of functional dependencies (FDs) and association rules (AR) since they allow to mix attributes and attribute/values in dependencies. In this paper, we introduce our first results on constant CFD inference. Not surprisingly, data mining techniques developed...

متن کامل

Conditional Dependencies: A Principled Approach to Improving Data Quality

Real-life date is often dirty and costs billions of pounds to businesses worldwide each year. This paper presents a promising approach to improving data quality. It effectively detects and fixes inconsistencies in real-life data based on conditional dependencies, an extension of database dependencies by enforcing bindings of semantically related data values. It accurately identifies records fro...

متن کامل

Defining and Mining Functional Dependencies in Probabilistic Databases

Functional dependencies – traditional, approximate and conditional are of critical importance in relational databases, as they inform us about the relationships between attributes. They are useful in schema normalization, data rectification and source selection. Most of these were however developed in the context of deterministic data. Although uncertain databases have started receiving attenti...

متن کامل

Discovering Data Quality Rules in a Master Data Management

Dirty data continues to be an important issue for companies. The datawarehouse institute [Eckerson, 2002], [Rockwell, 2012] stated poor data costs US businesses $611 billion dollars annually and erroneously priced data in retail databases costs US customers $2.5 billion each year. Data quality becomes more and more critical. The database community pays a particular attention to this subject whe...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013